Introduction


Nowadays, as climate change continues to cause sea-level rise, storm surge, and unusual weather conditions, flood events around the world are becoming more and more intense and frequent, or in the worst case, a loved one. As planner try to plan for the uncertainties of climate changes by making sustainable, resilient communities, the ability to predict and analyze flood inundation within an area is crucial. Here, we are trying to look at a city’s certain hydrological, natural, and built-environment features related to the probability of flood inundation. We examined their spatial correlation with flood inundation and used this relationship to further predict future floods as accurately as possible.

Setup

library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.6.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.6.3
library(pscl)
## Warning: package 'pscl' was built under R version 3.6.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(plotROC)
## Warning: package 'plotROC' was built under R version 3.6.3
library(pROC)
## Warning: package 'pROC' was built under R version 3.6.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:plotROC':
## 
##     ggroc
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(sf)
## Warning: package 'sf' was built under R version 3.6.3
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.3.0 --
## √ tibble  3.0.1     √ dplyr   0.8.4
## √ tidyr   1.0.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0
## √ purrr   0.3.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.6.3
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

Calgary Flood Prediction

Create Fishnet Cell

To build the prediction model, first, we need to look for possible features within a city related to flood inundation. Calgary is our target city and we have also chosen Toronto as a comparison city as these two cities are similar in size and have both been through flood events in recent years. Since many of the features we want to include are not aggregated at the same level and not in the same forms, we have to create our unit of analysis first. We decided to use a 300m * 300m cell as our basic unit of analysis and created fishnets from it covering the study area in ArcGIS. Then, we summarized attributes from our gathered features as well as flood inundation information into the fishnets using tools like Spatial Join, Reclassify, and Zonal Statistics. Therefore, we have the dataset for Calgary and Toronto that contains both the dependent variable, flood inundation indicator, and the independent variable. To be specific, our independent variables include elevation, distance to steep slopes, distance to water bodies, number of water breaks, percent of impervious surface.



Considering Toronto flood inundation information we collected is in vectors whereas Calgary flood inundation information is in raster, we expect the Toronto flood inundation map will be overshooting some of the flood inundation areas.

Then, we imported our datasets into R and ran logistic regression to determine if those features are related to flood inundation or not and built prediction models from it.

Exploratory Analysis

First, we set the directory and read the data.

setwd("D:/OneDrive - PennO365/2021SpringSemester/CPLN675/HW/Midterm")
dat_cg <- st_read("./Calgary/cg_fishnet.shp")
## Reading layer `cg_fishnet' from data source `D:\OneDrive - PennO365\2021SpringSemester\CPLN675\HW\Midterm\Calgary\cg_fishnet.shp' using driver `ESRI Shapefile'
## Simple feature collection with 8459 features and 12 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 550263.8 ymin: 5630643 xmax: 580563.8 ymax: 5665743
## proj4string:    +proj=tmerc +lat_0=0 +lon_0=-115 +k=0.9992 +x_0=500000 +y_0=0 +datum=NAD83 +units=m +no_defs

Plot the flood inundation areas in Calgary using the fishnet we built in last step.

ggplot() + 
  geom_sf(data=dat_cg, aes(fill=as.factor(inundation)), colour=NA) +
  scale_fill_manual(values = c("#FDE725", "#450D54"),
                    labels = c("No Flood Inundation","Flood Inundation"),
                    name = "") +
  labs(title="Flood inundation areas in Calgary by way of the fishnet") +
  mapTheme()

Take a look at the variables we have in our dataset.

str(dat_cg)
## Classes 'sf' and 'data.frame':   8459 obs. of  13 variables:
##  $ N_WATERBRE: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ inundation: num  1 1 1 0 0 0 0 0 0 0 ...
##  $ ID        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ ELEVATION : num  980 981 979 1089 1073 ...
##  $ DRAINAGE  : num  76 76 76 78 78 76 74 70 70 70 ...
##  $ D_SLOPE   : num  1140 1059 948 0 0 ...
##  $ FLOWLENGTH: num  55.9 120.6 34.6 372.3 209.5 ...
##  $ IPAREA    : num  35297 1742 35149 392 1960 ...
##  $ IPPCT     : num  0.39218 0.01936 0.39054 0.00435 0.02178 ...
##  $ D_RIVER   : num  0 0 0 0 0 ...
##  $ Shape_Leng: num  1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...
##  $ Shape_Area: num  90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 ...
##  $ geometry  :sfc_POLYGON of length 8459; first list element: List of 1
##   ..$ : num [1:5, 1:2] 573364 573364 573664 573664 573364 ...
##   ..- attr(*, "class")= chr  "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr  "N_WATERBRE" "inundation" "ID" "ELEVATION" ...

Convert the data type of inundation varaible to integer.

dat_cg$inundation <- as.integer(dat_cg$inundation)
str(dat_cg)
## Classes 'sf' and 'data.frame':   8459 obs. of  13 variables:
##  $ N_WATERBRE: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ inundation: int  1 1 1 0 0 0 0 0 0 0 ...
##  $ ID        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ ELEVATION : num  980 981 979 1089 1073 ...
##  $ DRAINAGE  : num  76 76 76 78 78 76 74 70 70 70 ...
##  $ D_SLOPE   : num  1140 1059 948 0 0 ...
##  $ FLOWLENGTH: num  55.9 120.6 34.6 372.3 209.5 ...
##  $ IPAREA    : num  35297 1742 35149 392 1960 ...
##  $ IPPCT     : num  0.39218 0.01936 0.39054 0.00435 0.02178 ...
##  $ D_RIVER   : num  0 0 0 0 0 ...
##  $ Shape_Leng: num  1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...
##  $ Shape_Area: num  90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 ...
##  $ geometry  :sfc_POLYGON of length 8459; first list element: List of 1
##   ..$ : num [1:5, 1:2] 573364 573364 573664 573664 573364 ...
##   ..- attr(*, "class")= chr  "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr  "N_WATERBRE" "inundation" "ID" "ELEVATION" ...

Build some bar plots that show differences in our independent variables that are flood inundation areas and not flood inundation areas.

cgPlotVariables <- 
  dat_cg %>%
  as.data.frame() %>%
  select(inundation,ELEVATION,DRAINAGE,D_SLOPE,FLOWLENGTH,IPAREA,IPPCT,D_RIVER,N_WATERBRE) %>%
  gather(key, value, ELEVATION:N_WATERBRE)

ggplot(cgPlotVariables, aes(as.factor(inundation), value, fill=as.factor(inundation))) + 
     geom_bar(stat="identity") + 
     facet_wrap(~key) +
     scale_fill_manual(values = c("darkgreen", "dodgerblue4"),
                      labels = c("No Flood inundation lands","Flood inundation lands"),
                      name = "") +
    labs(x="Flood inundation", y="Value")

Scale the variables so that the differences are apparent.

temp_cgPlotVariables <- 
  cgPlotVariables %>%
  mutate(value = ifelse(key == "FLOWLENGTH", value/10, value)) 

temp_cgPlotVariables <- 
  temp_cgPlotVariables %>%
  mutate(value = ifelse(key == "IPAREA", value/10, value)) 

temp_cgPlotVariables <- 
  temp_cgPlotVariables %>%
  mutate(value = ifelse(key == "DRAINAGE", value*100, value)) 

temp_cgPlotVariables <- 
  temp_cgPlotVariables %>%
  mutate(value = ifelse(key == "IPPCT", value*10000, value)) 

temp_cgPlotVariables <- 
  temp_cgPlotVariables %>%
  mutate(value = ifelse(key == "N_WATERBRE", value*1000, value)) 

ggplot(temp_cgPlotVariables, aes(as.factor(inundation), value, fill=as.factor(inundation))) + 
     geom_bar(stat="identity") + 
     facet_wrap(~key) +
     scale_fill_manual(values = c("darkgreen", "dodgerblue4"),
                      labels = c("No Flood inundation lands","Flood inundation lands"),
                      name = "") +
    labs(x="Flood inundation", y="Value")

Data Wrangling

Select the variables we will use in the prediction model.

dat_cg <- 
  dat_cg %>%
  select(inundation,ELEVATION,DRAINAGE,D_SLOPE,FLOWLENGTH,IPAREA,IPPCT,D_RIVER,N_WATERBRE) 

Normalize the features except the inundation variable.

dat_cg_norm <- dat_cg %>% mutate_at(2:9, funs((.-min(.))/max(.-min(.))))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Feature Engineering

Check the correlation between the independent variables.

#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
dat_cg_norm.cor = cor(dat_cg_norm %>%
                   as.data.frame() %>%
                   select(-geometry, -inundation))
corrplot(dat_cg_norm.cor)

IPAREA has significant correlation with IPPCT, so we take out one of the variables.

dat_cg_norm <- 
  dat_cg_norm %>%
  select(inundation,ELEVATION,DRAINAGE,D_SLOPE,FLOWLENGTH,IPPCT,D_RIVER,N_WATERBRE) 

Let’s take a look at the variables we selected.

str(dat_cg_norm)
## Classes 'sf' and 'data.frame':   8459 obs. of  9 variables:
##  $ inundation: int  1 1 1 0 0 0 0 0 0 0 ...
##  $ ELEVATION : num  0.0421 0.0433 0.0378 0.3842 0.3353 ...
##  $ DRAINAGE  : num  0.844 0.844 0.844 0.867 0.867 ...
##  $ D_SLOPE   : num  0.117 0.1086 0.0973 0 0 ...
##  $ FLOWLENGTH: num  0.001 0.00216 0.00062 0.00668 0.00376 ...
##  $ IPPCT     : num  0.39218 0.01936 0.39054 0.00435 0.02178 ...
##  $ D_RIVER   : num  0 0 0 0 0 ...
##  $ N_WATERBRE: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ geometry  :sfc_POLYGON of length 8459; first list element: List of 1
##   ..$ : num [1:5, 1:2] 573364 573364 573664 573664 573364 ...
##   ..- attr(*, "class")= chr  "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr  "inundation" "ELEVATION" "DRAINAGE" "D_SLOPE" ...

Split the data to training set and testing set.

set.seed(3456)
trainIndex <- createDataPartition(dat_cg_norm$inundation, p = .70,
                                  list = FALSE,
                                  times = 1)
cgTrain <- dat_cg_norm[ trainIndex,]
cgTest  <- dat_cg_norm[-trainIndex,]

Build Prediction Model

Build a logistics regression using the data we selected. As the summary table shows, elevation, distance to steep slope(>8.5 degree), distance to waterbody, flow length of the watershed, percent of impervious surface, and number of broken water pipe have significant correlation with the probability of flood inundation. The level of soil drainage also has correlation with the flood inundation risk.

cgModel <- glm(inundation ~ ., 
                    family="binomial"(link="logit"), data = cgTrain %>%
                                                            as.data.frame() %>%
                                                            select(-geometry))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(cgModel)
## 
## Call:
## glm(formula = inundation ~ ., family = binomial(link = "logit"), 
##     data = cgTrain %>% as.data.frame() %>% select(-geometry))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0531  -0.1447  -0.0120  -0.0003   3.8732  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.1413     0.2175   9.845  < 2e-16 ***
## ELEVATION   -13.4032     1.0204 -13.135  < 2e-16 ***
## DRAINAGE      0.3418     0.1906   1.793 0.072999 .  
## D_SLOPE      -6.5303     0.7157  -9.124  < 2e-16 ***
## FLOWLENGTH    3.2758     0.3958   8.277  < 2e-16 ***
## IPPCT         1.1533     0.2994   3.852 0.000117 ***
## D_RIVER     -36.6293     2.4668 -14.849  < 2e-16 ***
## N_WATERBRE   -5.2181     1.3016  -4.009  6.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3791.9  on 5921  degrees of freedom
## Residual deviance: 1701.0  on 5914  degrees of freedom
## AIC: 1717
## 
## Number of Fisher Scoring iterations: 9

Model Validation

We make classProbs which is the predicted probability of a test set fishnet cell being in flood inundation on our model. Most of the the predicted probabilities of fishnet cells are between 0 to 0.1.

classProbs <- predict(cgModel, cgTest, type="response")

hist(classProbs)

Plot the predicted probability by flood inundation risk. As the image shows, for the testing set, most of the fishnet cells (highlighted in green) that are not in flood inundation areas has a predicted probability between 0 to 0.25, most of the fishnet cells (highlighted in blue) that are in flood inundation areas has a predicted probability lager than 0.45.

testProbs <- data.frame(obs = as.numeric(cgTest$inundation),
                        pred = classProbs)

ggplot(testProbs, aes(x = pred, fill=as.factor(obs))) + geom_density() +
  facet_grid(obs ~ .) + xlab("Probability") + geom_vline(xintercept = .5) +
  scale_fill_manual(values = c("darkgreen", "dodgerblue4"),
                      labels = c("Not Flood inundation","Flood inundation"),
                      name = "")

To decide which level to classify whether flood inundation or not, we choose 50%, 25%, and then create a confusion matrix, respectively. The confusion matrix of level 50% shows that the accuracy of the model is 0.9484, the balanced accuracy is 0.8134, the sensitivity is 0.6481, the specificity is 0.978. The confusion matrix of level 25% shows that the accuracy of the model is 0.9263, the balanced accuracy is 0.8977, the sensitivity is 0.8627, the specificity is 0.9327. The accuracy of level 50% is slightly higher than that of level 25%, but the balanced accuracy of level 25% is significantly higher than the balanced accuracy of level 50%. Since the number of fishnet cells labeled as flood inundation(1) much more than the number of fishnet cells labeled as no flood inundation(0), the balanced accuracy which is adjusted considering the difference of numbers of data label is more convincing than the accuracy. Additionally, failure to predict and prepare for potential future floods could result in loss of properties, income, the cost of wrongly predicting flood inundation as no flood inundation is much higher than the cost of wrongly predicting no flood inundation as flood inundation, we value the model’s ability of correctly identifying the cells being flood inundation more, so sensitivity is more important in this model compared to specificity. The prediction at level of 25% has a significantly higher sensitivity than the prediction at level of 50%, so we choose 25% as the dividing line.

testProbs$predClass  = ifelse(testProbs$pred > .5 ,1,0)

caret::confusionMatrix(reference = as.factor(testProbs$obs), 
                       data = as.factor(testProbs$predClass), 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2255   82
##          1   49  151
##                                          
##                Accuracy : 0.9484         
##                  95% CI : (0.939, 0.9567)
##     No Information Rate : 0.9082         
##     P-Value [Acc > NIR] : 2.405e-14      
##                                          
##                   Kappa : 0.6694         
##                                          
##  Mcnemar's Test P-Value : 0.005176       
##                                          
##             Sensitivity : 0.64807        
##             Specificity : 0.97873        
##          Pos Pred Value : 0.75500        
##          Neg Pred Value : 0.96491        
##              Prevalence : 0.09184        
##          Detection Rate : 0.05952        
##    Detection Prevalence : 0.07883        
##       Balanced Accuracy : 0.81340        
##                                          
##        'Positive' Class : 1              
## 
testProbs$predClass  = ifelse(testProbs$pred > .25 ,1,0)

caret::confusionMatrix(reference = as.factor(testProbs$obs), 
                       data = as.factor(testProbs$predClass), 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2149   32
##          1  155  201
##                                           
##                Accuracy : 0.9263          
##                  95% CI : (0.9154, 0.9362)
##     No Information Rate : 0.9082          
##     P-Value [Acc > NIR] : 0.0006448       
##                                           
##                   Kappa : 0.6429          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.86266         
##             Specificity : 0.93273         
##          Pos Pred Value : 0.56461         
##          Neg Pred Value : 0.98533         
##              Prevalence : 0.09184         
##          Detection Rate : 0.07923         
##    Detection Prevalence : 0.14032         
##       Balanced Accuracy : 0.89769         
##                                           
##        'Positive' Class : 1               
## 

Plot the true positives, true negatives, false negatives and false positives for the training set in Calgary. “Flood inundation” is a positive class, labeled as 1. “No flood inundation” is a negative class, labeled as 0. So we can conclude that: TP: true value = 1 and prediction = 1, TN: true value = 0 and prediction = 0, FP: true value = 0 and prediction = 1, FN: true value = 1 and prediction = 0.

cgTrain$prediction <- predict(cgModel, cgTrain, type="response")

cgTrain <- cgTrain %>%
  mutate(flood_class = case_when(
    prediction<0.25 ~ 0,
    prediction>=0.25 ~ 1
    ))

cgTrain <- cgTrain %>%
  mutate(tfpn = case_when(
  inundation==1 & flood_class==1 ~ "True_Positive",
  inundation==1 & flood_class==0 ~ "False_Negative",
  inundation==0 & flood_class==1 ~ "Flase_Positive",
  inundation==0 & flood_class==0 ~ "True_Negative"
  ))

From the map below we can see that, the model misclassify fishnet cells (highlighted in green) that are close to river and flood inundation areas as flood inundation areas, some fishnet cells (highlighted in pink) surrounded by no flood inundation areas are misclassify as no flood inundation areas.

ggplot() + 
  geom_sf(data=cgTrain, aes(fill=tfpn), colour=NA) +
  scale_fill_discrete(name = "") +
  labs(title="Prediction Accuracy by way of the fishnet") +
  mapTheme()

Plot ROC curve The ROC curve shows the trade-off between sensitivity and specificity. The ROC curve is close to the top-left corner, indicating the model has a good prediction performance.

ggplot(testProbs, aes(d = obs, m = pred)) + 
  geom_roc(n.cuts = 50, labels = FALSE) + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') 

Calculate the area under the curve, the auc is 0.9616, which is very close to 1, the prediction performance at testing set is very good.

auc(testProbs$obs, testProbs$pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.9616

Cross Validation

To test the power of the model on out of sample data, we creates many randomly generated test sets or ‘folds’, testing the power of our model on each. The result shows that the average accuracy across all 100 folds is 0.9519, indicating the model has a good performance in predictin the data out of sample data.

ctrl <- trainControl(method = "cv", 
                     number = 100, 
                     savePredictions = TRUE)

cvFit <- train(as.factor(inundation) ~ .,  data = dat_cg_norm %>% 
                                                as.data.frame() %>%
                                                select(-geometry), 
               method="glm", family="binomial",
               trControl = ctrl)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cvFit
## Generalized Linear Model 
## 
## 8459 samples
##    7 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 8374, 8373, 8374, 8375, 8375, 8374, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9514134  0.7035496

Plot a histogram of accuracy across all 100 folds. The histogram shows the variability of accuracy across all 100 folds, the accuracy values are concentrated closely near 0.9, the model is generalizable to ‘out-of-sample’ data.

ggplot(as.data.frame(cvFit$resample), aes(Accuracy)) + 
  geom_histogram() +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Accuracy",
       y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).

Map Prediction for Calgary

Make prediction for the entire dataset and add the prediction value to the dataset.

allPredictions <- 
  predict(cvFit, dat_cg_norm, type="prob")[,2]
  
dat_cg_norm <- 
  cbind(dat_cg_norm,allPredictions) %>%
  mutate(allPredictions = round(allPredictions * 100)) 

Map the predicted probabilities.

 ggplot() + 
    geom_sf(data=dat_cg_norm, aes(fill=allPredictions), colour=NA) +
    scale_fill_viridis_c(begin=1, end=0, name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)", ) +
  mapTheme() +
  labs(title="Flood inundation prediction, Calgary")

Classify the predictions to flood inundation and no flood inundation, the baseline is 0.25.

dat_cg_norm <- dat_cg_norm %>%
  mutate(flood_class = case_when(
    allPredictions<25 ~ 0,
    allPredictions>=25 ~ 1
    ))

Map the predictided labels.

ggplot() + 
  geom_sf(data=dat_cg_norm, aes(fill=as.factor(flood_class)), colour=NA) +
  scale_fill_manual(values = c("#FDE725", "#450D54"),
                    labels = c("No Flood Inundation","Flood Inundation"),
                    name = "") +
  labs(title="Flood inundation areas prediction in Calgary by way of the fishnet") +
  mapTheme()

Map the flood inundation areas to compare the true value and prediction value spatially. The model has a good performance in predicting flood inundation, but misclassify some of the cells that are close to flood inundation areas to flood inundation.

ggplot() + 
  geom_sf(data=dat_cg, aes(fill=as.factor(inundation)), colour=NA) +
  scale_fill_manual(values = c("#FDE725", "#450D54"),
                    labels = c("No Flood Inundation","Flood Inundation"),
                    name = "") +
  labs(title="Flood inundation areas in Calgary by way of the fishnet") +
  mapTheme()

Toronto Flood Prediction

Finally, we select Toronto as the comparable city, using the model we built based on the data of Calgary, to evaluate the prediction performance of the flood inundation model for other city.

Exploratory Analysis

We set the directory and read the data.

setwd("D:/OneDrive - PennO365/2021SpringSemester/CPLN675/HW/Midterm")
dat_tr <- st_read("./Toronto/tr_fishnet.shp")
## Reading layer `tr_fishnet' from data source `D:\OneDrive - PennO365\2021SpringSemester\CPLN675\HW\Midterm\Toronto\tr_fishnet.shp' using driver `ESRI Shapefile'
## Simple feature collection with 7452 features and 12 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 293583.8 ymin: 4826571 xmax: 335883.8 ymax: 4857171
## proj4string:    +proj=tmerc +lat_0=0 +lon_0=-79.5 +k=0.9999 +x_0=304800 +y_0=0 +ellps=clrk66 +units=m +no_defs

Check out the variables we have in Toronto dataset.

dat_tr$inundation <- as.integer(dat_tr$inundation)
str(dat_tr)
## Classes 'sf' and 'data.frame':   7452 obs. of  13 variables:
##  $ N_WATERBRE: num  0 0 0 0 0 0 5 12 3 6 ...
##  $ inundation: int  1 1 1 1 1 1 1 1 0 0 ...
##  $ ID        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ ELEVATION : num  77.5 74.5 83.4 79.4 75.3 ...
##  $ DRAINAGE  : num  30 60 60 40 40 30 40 30 60 60 ...
##  $ D_SLOPE   : num  902 1002 586 610 750 ...
##  $ IPAREA    : num  4691 236 984 14819 15984 ...
##  $ IPPCT     : num  0.05212 0.00263 0.01093 0.16466 0.17759 ...
##  $ FLOWLENGTH: num  272 378 962 804 758 ...
##  $ Shape_Leng: num  1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...
##  $ Shape_Area: num  90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 ...
##  $ D_RIVER   : num  4749 4850 4375 4465 4572 ...
##  $ geometry  :sfc_POLYGON of length 7452; first list element: List of 1
##   ..$ : num [1:5, 1:2] 301084 301084 301384 301384 301084 ...
##   ..- attr(*, "class")= chr  "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr  "N_WATERBRE" "inundation" "ID" "ELEVATION" ...

Creat a dataframe with variables we’re gonna plot. Then, build some bar plots that show differences in our independent variables that are flood inundation areas and not flood inundation areas.

trPlotVariables <- 
  dat_tr %>%
  as.data.frame() %>%
  select(inundation,ELEVATION,DRAINAGE,D_SLOPE,FLOWLENGTH,IPAREA,IPPCT,D_RIVER,N_WATERBRE) %>%
  gather(key, value, ELEVATION:N_WATERBRE)

ggplot(trPlotVariables, aes(as.factor(inundation), value, fill=as.factor(inundation))) + 
     geom_bar(stat="identity") + 
     facet_wrap(~key) +
     scale_fill_manual(values = c("darkgreen", "dodgerblue4"),
                      labels = c("No Flood inundation","Flood inundation"),
                      name = "") +
    labs(x="Flood inundation", y="Value")

Scale the variables so that the differences are apparent.

temp_trPlotVariables <- 
  trPlotVariables %>%
  mutate(value = ifelse(key == "FLOWLENGTH", value/100, value)) 

temp_trPlotVariables <- 
  temp_trPlotVariables %>%
  mutate(value = ifelse(key == "IPAREA", value/100, value)) 

temp_trPlotVariables <- 
  temp_trPlotVariables %>%
  mutate(value = ifelse(key == "DRAINAGE", value*10, value)) 

temp_trPlotVariables <- 
  temp_trPlotVariables %>%
  mutate(value = ifelse(key == "IPPCT", value*1000, value)) 

temp_trPlotVariables <- 
  temp_trPlotVariables %>%
  mutate(value = ifelse(key == "N_WATERBRE", value*100, value)) 

ggplot(temp_trPlotVariables, aes(as.factor(inundation), value, fill=as.factor(inundation))) + 
     geom_bar(stat="identity") + 
     facet_wrap(~key) +
     scale_fill_manual(values = c("darkgreen", "dodgerblue4"),
                      labels = c("No Flood inundation","Flood inundation"),
                      name = "") +
    labs(x="Flood inundation", y="Value")

Data Wrangling

Select the variables used in the model we built from Calgary dataset.

dat_tr <- 
  dat_tr %>%
  select(inundation,ELEVATION,DRAINAGE,D_SLOPE,FLOWLENGTH,IPPCT,D_RIVER,N_WATERBRE) 

Feature Engineering

Plot the correlation. There is no significant correlation between the selected variables.

dat_tr.cor = cor(dat_tr %>%
                   as.data.frame() %>%
                   select(-geometry, -inundation))
corrplot(dat_tr.cor)

Normalize the features.

dat_tr_norm <- dat_tr %>% mutate_at(2:8, funs((.-min(.))/max(.-min(.))))

Map Prediction for Toronto

Make prediction for the entire dataset and add the prediction value to the dataset.

allPredictions <- 
  predict(cvFit, dat_tr_norm, type="prob")[,2]
  
dat_tr_norm <- 
  cbind(dat_tr_norm,allPredictions) %>%
  mutate(allPredictions = round(allPredictions * 100)) 

Plot the predicted probabilites.

 ggplot() + 
    geom_sf(data=dat_tr_norm, aes(fill=allPredictions), colour=NA) +
    scale_fill_viridis_c(begin=1, end=0, name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)", ) +
  mapTheme() +
  labs(title="Flood inundation prediction, Toronto")

Classify the predictions to flood inundation and no flood inundation,here we set the baseline to 25%, as same as the baseline for Calgary.

dat_tr_norm <- dat_tr_norm %>%
  mutate(flood_class = case_when(
    allPredictions<25 ~ 0,
    allPredictions>=25 ~ 1
    ))

Plot the predicted labels.

ggplot() + 
  geom_sf(data=dat_tr_norm, aes(fill=as.factor(flood_class)), colour=NA) +
  scale_fill_manual(values = c("#FDE725", "#450D54"),
                    labels   = c("No Flood Inundation","Flood Inundation"),
                    name = "") +
  labs(title="Flood inundation areas prediction in Calgary by way of the fishnet") +
  mapTheme()

Also, map flood inundation areas to evaluate the prediction performance spatially. As the prediction map and the flood inundation map show, many of the cells that are flood inundation are wrongly predicted as no flood inundation, which may leave these potential flood areas without effective protection and preparedness.

ggplot() + 
  geom_sf(data=dat_tr, aes(fill=as.factor(inundation)), colour=NA) +
  scale_fill_manual(values = c("#FDE725", "#450D54"),
                    labels = c("No Flood Inundation","Flood Inundation"),
                    name = "") +
  labs(title="Flood inundation areas in Toronto by way of the fishnet") +
  mapTheme()

To evaluate the prediction performance in Toronto dataset, we make the confusion matrix. The confusion matrix shows that the accuracy is 0.6931, the balanced accuracy is 0.5877, the specificity is significantly high (0.9784), but the sensitivity is very low, only 0.1970. The Calgary flood prediction model has a good performance at predicting the no flood inundation in Toronto, but is not good at predicting flood inundation, the model is not fit for predicting the flood inundation in Toronto.

caret::confusionMatrix(reference = as.factor(dat_tr_norm$inundation), 
                       data = as.factor(dat_tr_norm$flood_class), 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4629 2185
##          1  102  536
##                                           
##                Accuracy : 0.6931          
##                  95% CI : (0.6825, 0.7036)
##     No Information Rate : 0.6349          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2095          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.19699         
##             Specificity : 0.97844         
##          Pos Pred Value : 0.84013         
##          Neg Pred Value : 0.67934         
##              Prevalence : 0.36514         
##          Detection Rate : 0.07193         
##    Detection Prevalence : 0.08561         
##       Balanced Accuracy : 0.58771         
##                                           
##        'Positive' Class : 1               
## 

Next Step

To improve the model prediction accuracy and generalizability so that it can be used in predicting flood inundation of other cities, we’ll adujst the independent variables and test other related variables, as well as trying other maching learning methods, including random forest, Adaboosting, DNN in the future work.